Introduction (ISL 1)

Biostat 274, Statistical Learning

Author

Dr. Jin Zhou @ UCLA

Published

January 3, 2026

Credit: This note heavily uses material from the books An Introduction to Statistical Learning: with Applications in R (ISL2) and Elements of Statistical Learning: Data Mining, Inference, and Prediction (ESL2).

Display system information for reproducibility.

sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.7.3

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.5.2    fastmap_1.2.0     cli_3.6.5        
 [5] tools_4.5.2       htmltools_0.5.8.1 yaml_2.3.10       rmarkdown_2.29   
 [9] knitr_1.50        jsonlite_2.0.0    xfun_0.53         digest_0.6.37    
[13] rlang_1.1.6       evaluate_1.0.5   
import IPython
print(IPython.sys_info())
{'commit_hash': '5ed988a91',
 'commit_source': 'installation',
 'default_encoding': 'utf-8',
 'ipython_path': '/Users/jinjinzhou/.virtualenvs/r-tensorflow/lib/python3.10/site-packages/IPython',
 'ipython_version': '8.33.0',
 'os_name': 'posix',
 'platform': 'macOS-15.7.3-arm64-arm-64bit',
 'sys_executable': '/Users/jinjinzhou/.virtualenvs/r-tensorflow/bin/python',
 'sys_platform': 'darwin',
 'sys_version': '3.10.16 (main, Mar  3 2025, 20:01:33) [Clang 16.0.0 '
                '(clang-1600.0.26.6)]'}

1 Overview of statistical/machine learning

In this class, we use the phrases statistical learning, machine learning, or simply learning interchangeably.

1.1 Supervised vs unsupervised learning

  • Supervised learning: input(s) -> output.
    • Prediction: the output is continuous (income, weight, bmi, …).
    • Classification: the output is categorical (disease or not, pattern recognition, …).
  • Unsupervised learning: no output. We learn relationships and structure in the data.
    • Clustering.
    • Dimension reduction.

1.2 Supervised learning

  • Predictors \[ X = \begin{pmatrix} X_1 \\ \vdots \\ X_p \end{pmatrix}. \] Also called inputs, covariates, regressors, features, independent variables.

  • Outcome \(Y\) (also called output, response variable, dependent variable, target).

    • In the regression problem, \(Y\) is quantitative (price, weight, bmi).
    • In the classification problem, \(Y\) is categorical. That is \(Y\) takes values in a finite, unordered set (survived/died, customer buy product or not, digit 0-9, object in image, cancer class of tissue sample).
  • We have training data \((\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\). These are observations (also called samples, instances, cases). Training data is often represented by a predictor matrix \[ \mathbf{X} = \begin{pmatrix} x_{11} & \cdots & x_{1p} \\ \vdots & \ddots & \vdots \\ x_{n1} & \cdots & x_{np} \end{pmatrix} = \begin{pmatrix} \mathbf{x}_1^T \\ \vdots \\ \mathbf{x}_n^T \end{pmatrix} \tag{1}\]
    and a response vector \[ \mathbf{y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} \]

  • Based on the training data, our goal is to

    • Accurately predict unseen outcome of test cases based on their predictors.
    • Understand which predictors affect the outcome, and how.
    • Assess the quality of our predictions and inferences.

1.2.1 Example: salary

  • The Wage data set collects the wage and other data for a group of 3000 male workers in the Mid-Atlantic region in 2003-2009.

  • Our goal is to establish the relationship between salary and demographic variables in population survey data.

  • Since wage is a quantitative variable, it is a regression problem.

library(gtsummary)
library(ISLR2)
library(tidyverse)

# Convert to tibble
Wage <- as_tibble(Wage) %>% print(width = Inf)
# A tibble: 3,000 × 11
    year   age maritl           race     education       region            
   <int> <int> <fct>            <fct>    <fct>           <fct>             
 1  2006    18 1. Never Married 1. White 1. < HS Grad    2. Middle Atlantic
 2  2004    24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
 3  2003    45 2. Married       1. White 3. Some College 2. Middle Atlantic
 4  2003    43 2. Married       3. Asian 4. College Grad 2. Middle Atlantic
 5  2005    50 4. Divorced      1. White 2. HS Grad      2. Middle Atlantic
 6  2008    54 2. Married       1. White 4. College Grad 2. Middle Atlantic
 7  2009    44 2. Married       4. Other 3. Some College 2. Middle Atlantic
 8  2008    30 1. Never Married 3. Asian 3. Some College 2. Middle Atlantic
 9  2006    41 1. Never Married 2. Black 3. Some College 2. Middle Atlantic
10  2004    52 2. Married       1. White 2. HS Grad      2. Middle Atlantic
   jobclass       health         health_ins logwage  wage
   <fct>          <fct>          <fct>        <dbl> <dbl>
 1 1. Industrial  1. <=Good      2. No         4.32  75.0
 2 2. Information 2. >=Very Good 2. No         4.26  70.5
 3 1. Industrial  1. <=Good      1. Yes        4.88 131. 
 4 2. Information 2. >=Very Good 1. Yes        5.04 155. 
 5 2. Information 1. <=Good      1. Yes        4.32  75.0
 6 2. Information 2. >=Very Good 1. Yes        4.85 127. 
 7 1. Industrial  2. >=Very Good 1. Yes        5.13 170. 
 8 2. Information 1. <=Good      1. Yes        4.72 112. 
 9 2. Information 2. >=Very Good 1. Yes        4.78 119. 
10 2. Information 2. >=Very Good 1. Yes        4.86 129. 
# ℹ 2,990 more rows
# Summary statistics
Wage %>% tbl_summary()
Characteristic N = 3,0001
year
    2003 513 (17%)
    2004 485 (16%)
    2005 447 (15%)
    2006 392 (13%)
    2007 386 (13%)
    2008 388 (13%)
    2009 389 (13%)
age 42 (34, 51)
maritl
    1. Never Married 648 (22%)
    2. Married 2,074 (69%)
    3. Widowed 19 (0.6%)
    4. Divorced 204 (6.8%)
    5. Separated 55 (1.8%)
race
    1. White 2,480 (83%)
    2. Black 293 (9.8%)
    3. Asian 190 (6.3%)
    4. Other 37 (1.2%)
education
    1. < HS Grad 268 (8.9%)
    2. HS Grad 971 (32%)
    3. Some College 650 (22%)
    4. College Grad 685 (23%)
    5. Advanced Degree 426 (14%)
region
    1. New England 0 (0%)
    2. Middle Atlantic 3,000 (100%)
    3. East North Central 0 (0%)
    4. West North Central 0 (0%)
    5. South Atlantic 0 (0%)
    6. East South Central 0 (0%)
    7. West South Central 0 (0%)
    8. Mountain 0 (0%)
    9. Pacific 0 (0%)
jobclass
    1. Industrial 1,544 (51%)
    2. Information 1,456 (49%)
health
    1. <=Good 858 (29%)
    2. >=Very Good 2,142 (71%)
health_ins
    1. Yes 2,083 (69%)
    2. No 917 (31%)
logwage 4.65 (4.45, 4.86)
wage 105 (85, 129)
1 n (%); Median (Q1, Q3)
# Plot wage ~ age
Wage %>%
  ggplot(mapping = aes(x = age, y = wage)) + 
  geom_point() + 
  geom_smooth() +
  labs(title = "Wage changes nonlinearly with age",
       x = "Age",
       y = "Wage (k$)")

# Plot wage ~ year
Wage %>%
  ggplot(mapping = aes(x = year, y = wage)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(title = "Average wage increases by $10k in 2003-2009",
       x = "Year",
       y = "Wage (k$)")

# Plot wage ~ education
Wage %>%
  ggplot(mapping = aes(x = education, y = wage)) + 
  geom_point() + 
  geom_boxplot() +
  labs(title = "Wage increases with education level",
       x = "Year",
       y = "Wage (k$)")

Summary statistics:

# Load the pandas library
import pandas as pd
# Load numpy for array manipulation
import numpy as np
# Load seaborn plotting library
import seaborn as sns
import matplotlib.pyplot as plt

# Set font size in plots
sns.set(font_scale = 2)
# Display all columns
pd.set_option('display.max_columns', None)

# Import Wage data
Wage = pd.read_csv(
  "./slides/data/Wage.csv",
  dtype =  {
    'maritl':'category', 
    'race':'category',
    'education':'category',
    'region':'category',
    'jobclass':'category',
    'health':'category',
    'health_ins':'category'
    }
  )
Wage
      year  age            maritl      race        education  \
0     2006   18  1. Never Married  1. White     1. < HS Grad   
1     2004   24  1. Never Married  1. White  4. College Grad   
2     2003   45        2. Married  1. White  3. Some College   
3     2003   43        2. Married  3. Asian  4. College Grad   
4     2005   50       4. Divorced  1. White       2. HS Grad   
...    ...  ...               ...       ...              ...   
2995  2008   44        2. Married  1. White  3. Some College   
2996  2007   30        2. Married  1. White       2. HS Grad   
2997  2005   27        2. Married  2. Black     1. < HS Grad   
2998  2005   27  1. Never Married  1. White  3. Some College   
2999  2009   55      5. Separated  1. White       2. HS Grad   

                  region        jobclass          health health_ins   logwage  \
0     2. Middle Atlantic   1. Industrial       1. <=Good      2. No  4.318063   
1     2. Middle Atlantic  2. Information  2. >=Very Good      2. No  4.255273   
2     2. Middle Atlantic   1. Industrial       1. <=Good     1. Yes  4.875061   
3     2. Middle Atlantic  2. Information  2. >=Very Good     1. Yes  5.041393   
4     2. Middle Atlantic  2. Information       1. <=Good     1. Yes  4.318063   
...                  ...             ...             ...        ...       ...   
2995  2. Middle Atlantic   1. Industrial  2. >=Very Good     1. Yes  5.041393   
2996  2. Middle Atlantic   1. Industrial  2. >=Very Good      2. No  4.602060   
2997  2. Middle Atlantic   1. Industrial       1. <=Good      2. No  4.193125   
2998  2. Middle Atlantic   1. Industrial  2. >=Very Good     1. Yes  4.477121   
2999  2. Middle Atlantic   1. Industrial       1. <=Good     1. Yes  4.505150   

            wage  
0      75.043154  
1      70.476020  
2     130.982177  
3     154.685293  
4      75.043154  
...          ...  
2995  154.685293  
2996   99.689464  
2997   66.229408  
2998   87.981033  
2999   90.481913  

[3000 rows x 11 columns]
Wage.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   year        3000 non-null   int64   
 1   age         3000 non-null   int64   
 2   maritl      3000 non-null   category
 3   race        3000 non-null   category
 4   education   3000 non-null   category
 5   region      3000 non-null   category
 6   jobclass    3000 non-null   category
 7   health      3000 non-null   category
 8   health_ins  3000 non-null   category
 9   logwage     3000 non-null   float64 
 10  wage        3000 non-null   float64 
dtypes: category(7), float64(2), int64(2)
memory usage: 115.2 KB
# summary statistics
Wage.describe(include = "all")
# Plot wage ~ age
sns.lmplot(
  data = Wage, 
  x = "age", 
  y = "wage", 
  lowess = True,
  scatter_kws = {'alpha' : 0.1},
  height = 8
  ).set(
  xlabel = 'Age', 
  ylabel = 'Wage (k$)'
  )
Figure 1: Wage changes nonlinearly with age.
Figure 2: Wage changes nonlinearly with age.
# Plot wage ~ year
sns.lmplot(
  data = Wage, 
  x = "year", 
  y = "wage", 
  scatter_kws = {'alpha' : 0.1},
  height = 8
  ).set(
  xlabel = 'Year', 
  ylabel = 'Wage (k$)'
  )
Figure 3: Average wage increases by $10k in 2003-2009.
Figure 4: Average wage increases by $10k in 2003-2009.
# Plot wage ~ education
ax = sns.boxplot(
  data = Wage, 
  x = "education", 
  y = "wage"
  )
ax.set(
  xlabel = 'Education', 
  ylabel = 'Wage (k$)'
  )
ax.set_xticklabels(ax.get_xticklabels(), rotation = 15)
Figure 5: Wage increases with education level.
# Plot wage ~ race
ax = sns.boxplot(
  data = Wage, 
  x = "race", 
  y = "wage"
  )
ax.set(
  xlabel = 'Race', 
  ylabel = 'Wage (k$)'
  )
ax.set_xticklabels(ax.get_xticklabels(), rotation = 15)
Figure 6: Any income inequality?

1.2.2 Example: stock market

Code
library(quantmod)

SP500 <- getSymbols(
  "^GSPC", 
  src = "yahoo", 
  auto.assign = FALSE, 
  from = "2022-01-01",
  to = "2022-12-31")

chartSeries(SP500, theme = chartTheme("white"),
            type = "line", log.scale = FALSE, TA = NULL)

  • The Smarket data set contains daily percentage returns for the S&P 500 stock index between 2001 and 2005.

  • Our goal is to predict whether the index will increase or decrease on a given day, using the past 5 days’ percentage changes in the index.

  • Since the outcome is binary (increase or decrease), it is a classification problem.

  • From the boxplots in Figure 7, it seems that the previous 5 days percentage returns do not discriminate whether today’s return is positive or negative.

# Data information
help(Smarket)

# Convert to tibble
Smarket <- as_tibble(Smarket) %>% print(width = Inf)
# A tibble: 1,250 × 9
    Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today Direction
   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <fct>    
 1  2001  0.381 -0.192 -2.62  -1.06   5.01    1.19  0.959 Up       
 2  2001  0.959  0.381 -0.192 -2.62  -1.06    1.30  1.03  Up       
 3  2001  1.03   0.959  0.381 -0.192 -2.62    1.41 -0.623 Down     
 4  2001 -0.623  1.03   0.959  0.381 -0.192   1.28  0.614 Up       
 5  2001  0.614 -0.623  1.03   0.959  0.381   1.21  0.213 Up       
 6  2001  0.213  0.614 -0.623  1.03   0.959   1.35  1.39  Up       
 7  2001  1.39   0.213  0.614 -0.623  1.03    1.44 -0.403 Down     
 8  2001 -0.403  1.39   0.213  0.614 -0.623   1.41  0.027 Up       
 9  2001  0.027 -0.403  1.39   0.213  0.614   1.16  1.30  Up       
10  2001  1.30   0.027 -0.403  1.39   0.213   1.23  0.287 Up       
# ℹ 1,240 more rows
# Summary statistics
summary(Smarket)
      Year           Lag1                Lag2                Lag3          
 Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
 1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
 Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
 Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
 3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
 Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
      Lag4                Lag5              Volume           Today          
 Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000  
 1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500  
 Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500  
 Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138  
 3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750  
 Max.   : 5.733000   Max.   : 5.73300   Max.   :3.1525   Max.   : 5.733000  
 Direction 
 Down:602  
 Up  :648  
           
           
           
           
# Plot Direction ~ Lag1, Direction ~ Lag2, ...
Smarket %>%
  pivot_longer(cols = Lag1:Lag5, names_to = "Lag", values_to = "Perc") %>%
  ggplot() + 
  geom_boxplot(mapping = aes(x = Direction, y = Perc)) +
  labs(
    x = "Today's Direction", 
    y = "Percentage change in S&P",
    title = "Up and down of S&P doesn't depend on previous day(s)'s percentage of change."
    ) +
  facet_wrap(~ Lag)
Figure 7: LagX is the percentage return for the previous X days.
# Import S&P500 data
Smarket = pd.read_csv("./slides/data/Smarket.csv")
Smarket
      Year   Lag1   Lag2   Lag3   Lag4   Lag5   Volume  Today Direction
0     2001  0.381 -0.192 -2.624 -1.055  5.010  1.19130  0.959        Up
1     2001  0.959  0.381 -0.192 -2.624 -1.055  1.29650  1.032        Up
2     2001  1.032  0.959  0.381 -0.192 -2.624  1.41120 -0.623      Down
3     2001 -0.623  1.032  0.959  0.381 -0.192  1.27600  0.614        Up
4     2001  0.614 -0.623  1.032  0.959  0.381  1.20570  0.213        Up
...    ...    ...    ...    ...    ...    ...      ...    ...       ...
1245  2005  0.422  0.252 -0.024 -0.584 -0.285  1.88850  0.043        Up
1246  2005  0.043  0.422  0.252 -0.024 -0.584  1.28581 -0.955      Down
1247  2005 -0.955  0.043  0.422  0.252 -0.024  1.54047  0.130        Up
1248  2005  0.130 -0.955  0.043  0.422  0.252  1.42236 -0.298      Down
1249  2005 -0.298  0.130 -0.955  0.043  0.422  1.38254 -0.489      Down

[1250 rows x 9 columns]
Smarket.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1250 entries, 0 to 1249
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Year       1250 non-null   int64  
 1   Lag1       1250 non-null   float64
 2   Lag2       1250 non-null   float64
 3   Lag3       1250 non-null   float64
 4   Lag4       1250 non-null   float64
 5   Lag5       1250 non-null   float64
 6   Volume     1250 non-null   float64
 7   Today      1250 non-null   float64
 8   Direction  1250 non-null   object 
dtypes: float64(7), int64(1), object(1)
memory usage: 88.0+ KB
# summary statistics
Smarket.describe(include = "all")
# Pivot to long format for facet plotting
Smarket_long = pd.melt(
  Smarket, 
  id_vars = ['Year', 'Volume', 'Today', 'Direction'], 
  value_vars = ['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5'],
  var_name = 'Lag',
  value_name = 'Perc'
  )
Smarket_long  
      Year   Volume  Today Direction   Lag   Perc
0     2001  1.19130  0.959        Up  Lag1  0.381
1     2001  1.29650  1.032        Up  Lag1  0.959
2     2001  1.41120 -0.623      Down  Lag1  1.032
3     2001  1.27600  0.614        Up  Lag1 -0.623
4     2001  1.20570  0.213        Up  Lag1  0.614
...    ...      ...    ...       ...   ...    ...
6245  2005  1.88850  0.043        Up  Lag5 -0.285
6246  2005  1.28581 -0.955      Down  Lag5 -0.584
6247  2005  1.54047  0.130        Up  Lag5 -0.024
6248  2005  1.42236 -0.298      Down  Lag5  0.252
6249  2005  1.38254 -0.489      Down  Lag5  0.422

[6250 rows x 6 columns]
g = sns.FacetGrid(Smarket_long, col = "Lag", col_wrap = 3, height = 10)
g.map_dataframe(sns.boxplot, x = "Direction", y = "Perc")

plt.clf()

1.2.3 Real Example (1)

Development and validation of a bronchoalveolar lavage genomic classifier for acute cellular rejection. EBioMedicine. 2025 Dec;122:106046. doi: 10.1016/j.ebiom.2025.106046.

  • Lung transplant recipients are at risk for acute cellular rejection (ACR), which is a major cause of morbidity and mortality.

  • Genomic classifier

  • RNA Seq data (Transcriptome) is a high-dimensional data set.

  • 183 lung transplant recipients with 58,735 gene transcripts’ expression levels measured

1.2.4 Real Example (2)

Dynamically Predicting Renal Complications After Development of Diabetes for Millions Across Biobanks, Transportability and Transferability

  • Diabetes is a major cause of kidney disease, which can lead to kidney failure and death.
  • After developing diabetes, the risk of kidney disease varies across individuals.
  • Goal: Develop a prediction model for kidney disease after developing diabetes.
  • EHR and Biobanks provide a rich source of data for developing prediction models.

1.2.5 Example: handwritten digit recognition

Figure 8: Examples of handwritten digits from the MNIST corpus (ISL Figure 10.3).
  • Input: 784 pixel values from \(28 \times 28\) grayscale images. Output: 0, 1, …, 9, 10 class-classification.

  • On the MNIST data set (60,000 training images, 10,000 testing images), accuracies of following methods were reported:

    Method Error rate
    tangent distance with 1-nearest neighbor classifier 1.1%
    degree-9 polynomial SVM 0.8%
    LeNet-5 0.8%
    boosted LeNet-4 0.7%

1.2.6 Example: more computer vision tasks

Some popular data sets from computer vision.

1.2.7 Example: classify the pixels in a satellite image, by usage

Figure 9: LANDSET images (ESL Figure 13.6).
  • LANDSAT: 82x100 pixels. Four heat-map images, two in the visible spectrum and two in the infrared, for an area of agricultural land in Australia.

  • Each pixel has a class label from the 7-element set {red soil, cotton, vegetation stubble, mixture, gray soil, damp gray soil, very damp gray soil}, determined manually by research assistants surveying the area. The objective is to classify the land usage at a pixel, based on the information in the four spectral bands.

1.3 Unsupervised learning

  • No outcome variable, just predictors.

  • Objective is more fuzzy: find groups that behave similarly, find features that behave similarly, find linear combinations of features with the most variations, generative models (transformers).

  • Difficult to know how well you are doing.

  • Can be useful in exploratory data analysis (EDA) or as a pre-processing step for supervised learning.

1.3.1 Example: gene expression

  • The NCI60 data set consists of 6,830 gene expression measurements for each of 64 cancer cell lines.
# NCI60 data and cancel labels
str(NCI60)
List of 2
 $ data: num [1:64, 1:6830] 0.3 0.68 0.94 0.28 0.485 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:64] "V1" "V2" "V3" "V4" ...
  .. ..$ : chr [1:6830] "1" "2" "3" "4" ...
 $ labs: chr [1:64] "CNS" "CNS" "CNS" "RENAL" ...
# Cancer type of each cell line
table(NCI60$labs)

     BREAST         CNS       COLON K562A-repro K562B-repro    LEUKEMIA 
          7           5           7           1           1           6 
MCF7A-repro MCF7D-repro    MELANOMA       NSCLC     OVARIAN    PROSTATE 
          1           1           8           9           6           2 
      RENAL     UNKNOWN 
          9           1 
# Apply PCA using prcomp function
# Need to scale / Normalize as
# PCA depends on distance measure
prcomp(NCI60$data, scale = TRUE, center = TRUE, retx = T)$x %>%
  as_tibble() %>%
  add_column(cancer_type = NCI60$labs) %>%
  # Plot PC2 vs PC1
  ggplot() + 
  geom_point(mapping = aes(x = PC1, y = PC2, color = cancer_type)) +
  labs(title = "Gene expression profiles cluster according to cancer types")

# Import NCI60 data
nci60_data = pd.read_csv('./slides/data/NCI60_data.csv')
nci60_labs = pd.read_csv('./slides/data/NCI60_labs.csv')
nci60_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Columns: 6830 entries, 1 to 6830
dtypes: float64(6830)
memory usage: 3.3 MB
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

# Obtain the first 2 principal components
nci60_tr = scale(nci60_data, with_mean = True, with_std = True)
nci60_pc = pd.DataFrame(
  PCA(n_components = 2).fit(nci60_tr).transform(nci60_tr),
  columns = ['PC1', 'PC2']
  )
nci60_pc['PC2'] *= -1  # for easier comparison with R
nci60_pc['cancer_type'] = nci60_labs
nci60_pc
          PC1        PC2 cancer_type
0   19.838024   3.556173         CNS
1   23.089165   6.445664         CNS
2   27.456083   2.466115         CNS
3   42.816790  -9.768247       RENAL
4   55.418510  -5.198767      BREAST
..        ...        ...         ...
59  17.996230 -47.241945    MELANOMA
60   4.415524 -42.311026    MELANOMA
61  22.967008 -36.104937    MELANOMA
62  19.176029 -50.401119    MELANOMA
63  13.232931 -35.129659    MELANOMA

[64 rows x 3 columns]
# Plot PC2 vs PC1
sns.relplot(
  kind = 'scatter', 
  data = nci60_pc, 
  x = 'PC1',
  y = 'PC2',
  hue = 'cancer_type',
  height = 10
  )

1.3.2 Example: mapping people from their genomes

  • The genetic makeup of \(n\) individuals can be represented by a matrix Equation 1, where \(x_{ij} \in \{0, 1, 2\}\) is the \(j\)-th genetic marker of the \(i\)-th individual.

    Is that possible to visualize the geographic relationship of these individuals?

  • Following picture is from the article Genes mirror geography within Europe by Novembre et al (2008) published in Nature.

1.3.3 Ancestry estimation

Figure 10: Unsupervised discovery of ancestry-informative markers and genetic admixture proportions. Paper.

1.4 No easy answer

In modern applications, the line between supervised and unsupervised learning is blurred.

1.4.1 Example: the Netflix prize

Figure 11: The Netflix challenge.
  • Competition started in Oct 2006. Training data is ratings for 480,189 Netflix customers \(\times\) 17,770 movies, each rating between 1 and 5.

  • Training data is very sparse, about 98% sparse.

  • The objective is to predict the rating for a set of 1 million customer-movie pairs that are missing in the training data.

  • Netflix’s in-house algorithm achieved a root MSE of 0.953. The first team to achieve a 10% improvement wins one million dollars.

  • Is this a supervised or unsupervised problem?

    • We can treat rating as outcome and user-movie combinations as predictors. Then it is a supervised learning problem.

    • Or we can treat it as a matrix factorization or low rank approximation problem. Then it is more of a unsupervised learning problem, similar to PCA.

1.4.2 Example: large language models (LLMs)

Modern large language models, such as ChatGPT, combine both supervised learning and reinforcement learning

1.5 Statistical learning vs machine learning

  • Machine learning arose as a subfield of Artificial Intelligence.

  • Statistical learning arose as a subfield of Statistics.

  • There is much overlap. Both fields focus on supervised and unsupervised problems.

    • Machine learning has a greater emphasis on large scale applications and prediction accuracy.

    • Statistical learning emphasizes models and their interpretability, and precision and uncertainty.

  • But the distinction has become more and more blurred, and there is a great deal of “cross-fertilization”.

  • Machine learning has the upper hand in Marketing!

1.6 A Brief History of Statistical Learning

Image source: https://people.idsia.ch/~juergen/deep-learning-history.html

  • 1676, chain rule by Leibniz.

  • 1805, least squares / linear regression / shallow learning by Gauss.

  • 1936, classification by linear discriminant analysis by Fisher.

  • 1940s, logistic regression.

  • Early 1970s, generalized linear models (GLMs).

  • Mid 1980s, classification and regression trees.

  • 1980s, generalized additive models (GAMs).

  • 1980s, neural networks gained popularity.

  • 1990s, support vector machines.

  • 2010s, deep learning.

2 Course logistics

2.1 Learning objectives

  1. Understand what machine learning is (and isn’t).

  2. Learn some foundational methods/tools.

  3. For specific data problems, be able to choose methods that make sense.

Tip

Q: Wait, Dr. Zhou! Why don’t we just learn the best method (aka deep learning) first?

A: No single method dominates. One method may prove useful in answering some questions on a given data set. On a related (not identical) data set or question, another might prevail. Article, Article

2.2 Syllabus

  • Read syllabus and schedule for a tentative list of topics and course logistics.

  • Homework assignments will be a mix of theoretical/conceptual and applied/computational questions. Although not required, you are highly encouraged to practice literate programming (using Jupyter, Quarto, RMarkdown, or Google Colab) coordinated through Git/GitHub. This will enhance your GitHub profile and make you more appealing on job market.

  • Although, I do not require homework submission through Git/Github. Homework submission is through BruinLearn.

  • We will mainly use R in this course.

2.3 What I expect from you

  • You are curious and are excited about “figuring stuff out”.

  • You are proficient in coding and debugging (or are ready to work to get there).

  • You have a solid foundation in introductory statistics (or are ready to work to get there).

  • You are willing to ask questions.

2.4 What you can expect from me

  • I value your learning experience and process.

  • I’m flexible with respect to the topics we cover.

  • I’m happy to share my professional connections.

  • I’ll try my best to be responsive in class, in office hours, and other professional encounters.

3 Notation and Simple Matrix Algebra

3.1 Notation

  • We will use \(n\) to represent the number of distinct data points, or observations, in our sample.

  • We will let \(p\) denote the number of variables that are available for use in making predictions.

    • For example, the Wage data set consists of 11 variables for 3,000 people, so we have \(n = 3,000\) observations and \(p = 11\) variables (such as year, age, race, and more).
    • \(p\) can be quite large, such as on the order of thousands or even millions, e.g., modern biological data, like gene expression, DNA sequences along the genome.
  • We will let \(x_{ij}\) represent the value of the \(j\)th variable for the \(i\)th observation, where \(i = 1,2,\ldots,n\) and \(j = 1,2,\ldots,p\).

  • We will let \(i\) be the index of the samples or observations (from 1 to \(n\)) and \(j\) will be used to index the variables (from 1 to \(p\)).

  • We let \(\mathbf{X}\) denote an \(n \times p\) matrix whose \((i, j )\)th element is \(x_{ij}\) \[ \mathbf{X} = \begin{pmatrix} x_{11} & \cdots & x_{1p} \\ \vdots & \ddots & \vdots \\ x_{n1} & \cdots & x_{np} \end{pmatrix} = \begin{pmatrix} \mathbf{x}_1^T \\ \vdots \\ \mathbf{x}_n^T \end{pmatrix} \]

  • Rows of X, which we write as \(x_1, x_2, \ldots , x_n\). Here \(x_i\) is a vector of length \(p\), containing the \(p\) variable measurements for the \(i\)th observation. That is, \[ x_i = \begin{pmatrix} x_{i1} \\ \vdots \\ x_{ip} \end{pmatrix}. \] Note: Vectors are by default represented as columns.

  • At other times we will instead be interested in the columns of \(\mathbf{X}\), which we write as \(\mathbf{x}_1,\mathbf{x}_2,\ldots,\mathbf{x}_p\). Each is a vector of length \(n\). That is, \[ \mathbf{x}_j = \begin{pmatrix} \mathbf{x}_{1j} \\ \vdots \\ \mathbf{x}_{nj} \end{pmatrix}. \]

  • Using this notation, the matrix \(\mathbf{X}\) can be written as \[ \mathbf{X} = (\mathbf{x}_1 \quad \mathbf{x}_2 \quad \ldots \quad \mathbf{x}_p), \] or \[ \mathbf{X} = \begin{pmatrix} x_{1}^T \\ \vdots \\ x_{n}^T \end{pmatrix}. \]

  • We use \(y_i\) to denote the \(i\)th observation of the variable on which we wish to make predictions (i.e., “outcome”), such as wage. Hence, we write the set of all \(n\) observations in vector form as \[ \mathbf{y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} \] Then our observed data consists of \({(x_1, y_1), (x_2, y_2), \ldots , (x_n, y_n)}\), where each \(x_i\) is a vector of length \(p\). (If \(p = 1\), then \(x_i\) is simply a scalar.)

3.2 Matrix Algebra

  • Matrices will be denoted using bold capitals, such as \(\mathbf{A}\).
  • To indicate that an object is a scalar, we will use the notation \(a \in R\).
  • To indicate that it is a vector of length \(k\),we will use \(\mathbf{a}\in R^k\) (or \(\mathbf{a}\in R^n\) if it is of length \(n\)).
  • We will indicate that an object is an \(r \times s\) matrix using \(\mathbf{A} \in R^{r\times s}\).

3.2.1 Special cases of matrices

  • A column vector is a matrix with only one column, e.g. \[ \mathbf{A} = \left(\begin{array}{c} 1 \\ 4 \\ 0\\ -2\\ \end{array}\right) \]

  • A row vector is a matrix with only one row, e.g. \[ \mathbf{A} = \left(\begin{array}{cccc} 1 & 4 & 0 & -2\\ \end{array}\right) \]

  • A matrix with \(r = s\), that is, with the same number of rows and columns is called a square matrix. If a matrix is square, the elements \(a_{ii}\) are said to lie on the diagonal of . \[ \mathbf{A} = \left(\begin{array}{cc} 1 & 4 \\ 0 & -2 \end{array}\right) \]

  • A square matrix is called if \(a_{ij} = a_{ji}\) for all values of i and j. \[ \mathbf{A} = \left(\begin{array}{ccc} 3 &5& 7 \\ 5 &1& 4 \\ 7 &4 &8 \end{array}\right) \] Symmetric matrices turn out to be quite important in formulating statistical models for all types of data!

  • An important special case of a square, symmetric matrix is the identity matrix, i.e., a square matrix with \(1\)s on diagonal, \(0\)s elsewhere, e.g. \[ \mathbf{A} = \left(\begin{array}{ccc} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0& 0& 1\\ \end{array}\right) \] The identity matrix functions the same way as “\(1\)” does in the real number system.

3.2.2 Matrix operations

3.2.2.1 Transpose

  • The \(^T\) notation denotes the transpose of a matrix or vector

\[ \mathbf{X}^T = \begin{pmatrix} x_{11} & \cdots & x_{1n} \\ \vdots & \ddots & \vdots \\ x_{p1} & \cdots & x_{pn} \end{pmatrix} = \begin{pmatrix} \mathbf{x}_1^T \\ \vdots \\ \mathbf{x}_n^T \end{pmatrix} \] So the transpose of a \(n\times p\) matrix is a \(p\times n\) matrix. That is, the transpose of \(A\) is the matrix found by ``flipping” the matrix around.

  • For example, \[ \mathbf{A} = \left( \begin{array}{ccc} 1 & 2 & 3 \\ 4 & 5 & 6 \\ \end{array} \right) \quad \mathbf{A}^T = \left( \begin{array}{cc} 1 & 4 \\ 2 & 5 \\ 3 & 6 \\ \end{array} \right) \] A fundamental property of a symmetric matrix is that the matrix and its transpose are the same; i.e., if \(\mathbf{A}\) is symmetric then \(\mathbf{A} = \mathbf{A}^T\). (Try it on the symmetric matrix above.)

3.2.2.2 Matrix Addition and Subtraction

Adding or subtracting two matrices are operations that are defined element-by-element. That is, to add to matrices, add their corresponding elements, e.g. \[ \mathbf{A} = \left( \begin{array}{cc} 1 & 2 \\ 4 & 5 \\ \end{array} \right) \quad \mathbf{B} = \left( \begin{array}{cc} 6 & 4 \\ 2 & -1 \\ \end{array} \right) \] Then, \[ \mathbf{A} + \mathbf{B} = \left( \begin{array}{cc} 7 & 6 \\ 6 & 4 \\ \end{array} \right) \quad \mathbf{A} - \mathbf{B} = \left( \begin{array}{cc} -5 &-2 \\ -2 & 6 \\ \end{array} \right) \] - Note that these operations only make sense if the two matrices have the same dimension; the operations are not defined otherwise.

3.2.2.3 Matrix Multiplication

  • The effect of multiplying a matrix \(\mathbf{A}\) with any dimension by a real number (scalar) \(b\), say, is to multiply each element in \(\mathbf{A}\) by \(b\). \[ 3\left( \begin{array}{cc} 1 & 2 \\ 4 & 5 \\ \end{array} \right) = \left( \begin{array}{cc} 3 & 6 \\ 12 & 15 \\ \end{array} \right) \]

  • General rules,

    • \(\mathbf{A} + \mathbf{B} = \mathbf{B} + \mathbf{A}\), \(b(\mathbf{A} + \mathbf{B})=b\mathbf{A} + b\mathbf{B}\)
    • \((\mathbf{A} + \mathbf{B})^T=\mathbf{A}^T + \mathbf{B}^T\), \((b\mathbf{A})^T=b\mathbf{A}^T\)
  • Order matters

    • Number of columns of first matrix must = Number of rows of second matrix, e.g., \[ \mathbf{A} = \left( \begin{array}{ccc} 1 & 2 &5 \\ 4 & 5 &1 \\ \end{array} \right) \quad \mathbf{B} = \left( \begin{array}{cc} 3 & 6 \\ 2 & 5 \\ 1 & 2 \\ \end{array} \right)\\ \quad \mathbf{C} = (c_{ij}) = \mathbf{AB} = \left( \begin{array}{cc} 12 & 26 \\ 23 & 51 \\ \end{array} \right) \]
  • Formally, if \(\mathbf{A}\) is \((r\times s)\) and \(\mathbf{B}\) is \((s\times q)\), then \(\mathbf{AB}\) is a \((r\times q)\) matrix with \((i,j)\)th element \[ \sum_{k=1}^s a_{ik}b_{kj}. \]

  • General rules,

    • \(\mathbf{A}(\mathbf{B} + \mathbf{C}) = \mathbf{A}\mathbf{B} + \mathbf{A}\mathbf{C}\), \((\mathbf{A}+\mathbf{B}) \mathbf{C} = \mathbf{A}\mathbf{C} + \mathbf{B}\mathbf{C}\)
    • For any matrix \(\mathbf{A}\), \(\mathbf{A}^T\mathbf{A}\) will be a square matrix.
    • The transpose of a matrix product: \((\mathbf{AB})^T=\mathbf{B}^T\mathbf{A}^T\).

3.2.3 Example

  • Consider a prediction model, e.g., wage data example: suppose that we have \(n\) pairs \((x_1,Y_1),\ldots,(x_n,Y_n)\), and we believe that, except for a random deviation, the relationship between the \(x\) (e.g., age) and the response \(\mathbf{Y}\) follows a straight line. That is, for \(j=1,\ldots,n\), we have \[ \mathbf{Y}_j = \beta_0 + \beta_1x_j + \epsilon_j, \] where \(\epsilon_j\) is a random deviation representing the amount by which the actual observed response \(Y_j\) deviates from the exact straight line relationship. Defining, \[ \mathbf{X}= \left( \begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots\\ 1&x_n\\ \end{array} \right),\quad Y= \left( \begin{array}{c} Y_1 \\ Y_2 \\ \vdots \\ Y_n\\ \end{array} \right),\quad \epsilon= \left( \begin{array}{c} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n\\ \end{array} \right), \beta= \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \end{array} \right), \] we may express the model succinctly as \[ \mathbf{Y}=\mathbf{X}\beta +\epsilon. \]